Study of the Relationship of Training Set Size to Error Rate in yet Another Decision Tree and Random Forest Algorithms

نویسنده

  • Susan Mengel
چکیده

ii ACKNOWLEDGMENTS I thank Almighty god for providing the strength and knowledge to pursue the research work. I would like to express my sincere gratitude to Dr. Susan Mengel, the chairperson of my committee for guiding through the research. In spite of her busy schedule, she was very helpful in solving the problems and presented new insights. It was a pleasure to work alongside with her which has helped shaped my academic objectives. I am glad to have Dr. Yu Zhuang as my committee member. He has provided me immense moral support during the entire course of my research. I would like to thank him for his full cooperation and support for successful completion of my thesis research. I would like to thank Ms. Colette Solpietro and other members at Office of Research Services for the support and confidence towards me. Finally, I would like to thank the most important people in my life, my parents, and my brother. I am here due to the immeasurable confidence and sacrifices made by them for me. I hope this work stands up to the high standards you have always expected of me. iii ABSTRACT Classification algorithms are the among the widely used data mining techniques for prediction. Among their different types, the decision tree is a classification predictive model with significant advantages over the other techniques by being easy to interpret, having quick construction, having high accuracy and using fewer resources. The decision tree model can be developed by algorithms like C4.5, CART, YaDT, and Random Forest where their performance is determined by error rates. This thesis research studies the relationship of training data size to error rate for the YaDT and Random Forest algorithms, and also compares the performance of both of them with the results of C4.5 & CART. This thesis research has been helpful in drawing various conclusions. For example, the well accepted 66.7:33.3 splitting ratio in the literature can be increased to 80:20 for large data sets with more than 1000 samples to generate more accurate decision tree models. The stability of all algorithms in the research is weak after 90:10 ratios due to very little testing data. This thesis research reveals that while YaDT performs similarly to C4.5 and CART, the performance of Random Forest is better than the other three significantly. The performance of models can be determined optimally with large data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

کاربرد الگوریتم‌های داده‌کاوی در تفکیک منابع رسوبی حوزۀ آبخیز نوده گناباد

Introduction: Reduction of sediment supply requires the implementation of soil conservation and sediment control programs in the form of watershed management plans. Sediment control programs require identifying the relative importance of sediment sources, their quantitative ascription and identification of critical areas within the watersheds. The sediment source ascription is involves two...

متن کامل

Application of ensemble learning techniques to model the atmospheric concentration of SO2

In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...

متن کامل

Personal Credit Score Prediction using Data Mining Algorithms (Case Study: Bank Customers)

Knowledge and information extraction from data is an age-old concept in scientific studies. In industrial decision-making processes, the application of this concept gives rise to data-mining opportunities. Personal credit scoring is an ever-vital tool for banking systems in order to manage and minimize the inherent risks of the financial sector, thus, the design and improvement of credit scorin...

متن کامل

مطالعات درخت تصمیم در برآورد ریسک ابتلا به سرطان سینه با استفاده از چند شکلی‌های تک نوکلوئیدی

Abstract Introduction:   Decision tree is the data mining tools to collect, accurate prediction and sift information from massive amounts of data that are used widely in the field of computational biology and bioinformatics. In bioinformatics can be predict on diseases, including breast cancer. The use of genomic data including single nucleotide polymorphisms is a very important ...

متن کامل

Determining Factors Influencing Length of Stay and Predicting Length of Stay Using Data Mining in the General Surgery Department

Background: Length of stay is one of the most important indicators in assessing hospital performance. A shorter stay can reduce the costs per discharge and shift care from inpatient to less expensive post-acute settings. It can lead to a greater readmission rate, better resource management, and more efficient services. Objective: This study aimed to ident...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006